Peter Bak, University of Konstanz, bak@dbvis.inf.uni-konstanz.de
Christian Rohrdantz, University of Konstanz, rohrdantz@dbvis.inf.uni-konstanz.de
Svenja Leifert, University of Konstanz, svenja.leifert@uni-konstanz.de
Christoph Granacher, University of Konstanz, christoph.granacher@uni-konstanz.de
Stefan Koch, University of Konstanz, stefanmoritzkoch@googlemail.com
Simon Butscher, University of Konstanz, simon.butscher@uni-konstanz.de
Patrick Jungk, University of Konstanz, patrick.jungk@uni-konstanz.de
VAT – Video Analysis Tool, developed at the at the University of
Konstanz 2009,
KNIME – Data Analysis and Visualization Tool, University of Konstanz: www.knime.org
Pajek, Network analysis program: http://pajek.imfm.si/doku.php
Video:
Traffic Video: traffic-video.wmv
Flitter Video: Flitter-Video.wmv
Video
Video: Video-Video.wmv
ANSWERS:
GC.1: Please describe the
scenario supported by your analysis of the three mini-challenges in a Debrief.
An employee of the US-Embassy in
Flovania is leaking information to a criminal organization. The employee with the stuff ID 30 is
considered suspicious for several reasons. We have evidence that he/she sent
large amounts of data from different computers in the embassy of fellow
colleagues to the external IP-address 100.59.151.133. He/she apparently took
advantage of the absence of colleagues and used their computers for his/her
criminal activity. In some cases the absence of the fellow colleagues from
their working places during the transactions to the mentioned IP-address is
clearly documented: This we know from the entry-logs, which show that these
fellow colleagues either had not yet arrived at the office in the morning, or
were traceably in the classified area. In the remaining cases of transactions
to the suspicious IP-Address we could gain evidence of their absence from their
network traffic behavior. Gaps in the continuance of their normal network
traffic suggest that they were making a break and therefore might have left
their office - and even though the employees are not required to log out of the
building, it is elusive from their network traffic when they stopped working in
the evenings. In any case of transactions to the IP 100.59.151.133 one of the
upper indications for the absence of the computer owner was given, and in
almost any case the respective roommate apparently was absent, too. This is not
true for two of the earliest suspicious transactions on 8th and 15th
of January: The employee with ID 30 was in office and active while the computer
of his roommate 31 was abused for criminal activities. This is only one of
several indications for the guilt of employee 30. While studying the
network-traffic behavior of the absent colleagues, we learned that both the
amount of traffic produced and the destination IP were unusual for their
behavior. Especially the amount of data sent is salient: Among the 16 network
traffic events with the highest request sizes during the whole month of
January, 13 were connections to the suspicious ID. This makes us confident that
the identified IP-Address was the one that was used for leaking information. We
also assessed the ‘alibi’ of all employees during the suspicious criminal
activity, and found that there are only two employees that have no clearance
for any of the times when data was sent to the mentioned IP-address, employee
30 and employee 27. The second suspect, with the stuff ID 27, was finally
disburdened, because the first four criminal activities were conducted from the
room mate of employee 30, while employee 30 was active at their office. It is
also evident, that he/she used the office neighbor’s computer for the criminal
activities repeatedly in the beginning, and then started to use others’
computers, in order to cover up his/her track. There is some more systematic
behavior of the suspect. He/she always sent the large amounts of data to the
same external IP address, on every week Tuesdays and Thursdays. He/she started
with a single transaction on January 8th, and continued sending
twice a day (10th,15th, 17th and 22nd)
and finally three times a day (29th, 31st of
January). Seemingly he/she was either
forced to raise the amount of material transmitted, or alternatively felt more
secure. In addition, the network traffic of employee 30 always went up
significantly immediately (1 to 2 minutes) after the suspicious activity. This
high traffic goes to unsuspicious IP-addresses, which makes us believe that it
is a diversionary tactic.
We were also able to uncover the
communication channel this suspicious employee (ID 30) has to the criminal
organization. The social networking/micro-blogging tool, Flitter, was used for
the communication with three handlers. In flitter the employee has the ID 100
and the handlers have the IDs 194, 261, 563. The handlers then communicate with
a person code named “Boris”, with Flitter ID 4994. Boris has a direct contact
to the fearless leader (Flitter ID 4) of the criminal organization. We are quite certain about this setting,
since all other possible scenarios could be discarded. The employee and the
three handlers live in Prounov, the second largest city of Flovania, which is
closely located to the capital city Koul. Boris lives in Kannvic, in the east,
and the fearless leader in Kouvnic, in the north of the country. The international contacts of the leader are
distributed in all surrounding countries (Tulamuk in Trium, Otello in Posana
and Transpasko in Transak). The fact that the employee and the handlers live in
the same city, makes us believe that they probably also met in person. May be
at these occasions money, information or even objects were exchanged. Since the
24th of January the number of transmissions rose to three at each
occasion, it is reasonable to believe that the compensations rose as well. This
hypothesis is backed up by suspicious events that we detected during the
analysis of some surveillance videos.
The analysis of the surveillance
video from public places near the Embassy revealed a number of suspicious
events. We declared events as suspicious, when two persons met or a person
approached a vehicle. Among the large number of events, we closely looked those
that are in the time frame of the activities uncovered by the network traffic
analysis. We gathered video data from the 24th and 26th of
January. On the 24th in morning there were several suspicious events
of two persons meeting. Some of these events coincide with the time range when
employee 30 was inactive. In particular: employee 30 came at 7:47am to office
and started working at 8:05am. During this gap, a suspicious meeting took place
at location 2 in the surveillance video starting at 8am and lasting for 1
minute. The next gap in his/her network traffic occurred between 8:06am and
9:09am, while he/she logged out from the classified area at 9:00am without
logging in! During this time period, four suspicious events were recorded by
the surveillance video: 8:14am for 1:47minutes and 8:37am for 1:11minute at location 2, 8:41am for 1minute at location
4, and at 8:43 for 1 minute at location 3. Each of these events shows the
meeting of two persons. At later time, between 12:09 to12:33am he/she was in
logged in to the classified area. During this time period two suspicious video
events occurred, at 12:21 for 23seconds and at 12:31 for 35seconds. As the
employee already proved to be not reliable for logging in and out the
classified area, we certainly believe that he/she might have faked to be in the
classified area. For the remaining suspicious events in the video it is also
possible that employee 30 might have left his/her work place for a short time.
In any case, he/she is not active during these events, but the question remains
how he/she managed to enter the embassy without logging in and how he/she could
manage to manipulate the logging system of the classified area. In the former
case, he/she might have piggybacked as other employees entered the building
regularly during the morning.
The 26th of January was a Saturday - which is not a working day for
the embassy employees - we detected several suspicious events as well.
Therefore, the suspicious employee could easily have met a handler at a public
location.
GC.2: Who are the major
players in the scenario and what are their relationships?
We conducted our analysis on
three parallel tracks. First track investigated the suspicious computer use at
the embassy. The second track investigated the Flitter data providing
information about the social communication network of the criminal
organization. The third track investigated the surveillance video data
recording the surrounding the embassy. These tracks are described in detail in
the following subchapters.
Network Traffic Analysis
The most characteristic trait of
suspicious computer use we found is that the guilty employee uses PCs of
workmates that have been absent from their offices.
Our process is a form of the KDD
pipeline (see Picture1) with three main iterative phases. The data needed to be prepared (Data
Preparation) and analysed with programs or visual analytics tools
(Interaction). Then we could draw conclusions and gather new information
(Knowledge).
Figure 1: Network Traffic Analysis - Pipeline
In the Data Preparation phase we
computed minutes per day and minutes per month. The proxLog and IPLog data
tables were put together into an Overview table containing the employee IDs,
types and time components, while „types“ is the „Type“ in the proxLog dataset
and „Socket“ in the IPLog.
We needed about one hour to
prepare the data as we were forced to undertake many small steps like splitting
strings, changing data types etc.
In later iterations, this phase
only contained different filterings or selections of the data (e.g. looking at
IDs seperately).
Everything except the joining of
the proxLog and IPLog data tables was done half-automatically: the
preprocessing steps needed to be identified manually and were carried out by
the system on the data tables.
We could then begin to search for
anomalies.
Plotting minutes per month
against minutes per day gives a nice overview over each person's data traffic
(see Picture2). Colours are mapped to request sizes and red squares symbolise the
largest amounts. This gave us a few suspicious IDs but no definite results.
However, we found the same IDs in different situations later.
Another approach was plotting the
Overview table for each ID with the minutes per month/minutes per day overview
and colours mapped to the type (blue=data traffic, green=prox-in-building,
red=prox-in-classified, yellow=prox-out-classified; see Picture3). In several cases, a blue square appears between a red and yellow one,
which means, the ID's PC had been used while he/she was in the classified area.
To make sure we found all
suspicious moments manually, we wrote program 1 to support our findings (see
first part, first four rows, Picture4). Two IDs that logged into the classified area without logging out
later (ID 38 on the 4th at 13:12, ID 49 on the 8th at
12:56) were discovered. So we wrote program 2 that detected three further
exceptions: ID 30 logged out without logging in before (on the 10th
at 10:33, 17th at 11:31 and 24th at 9:00).
In this part of the process,
everything but the detection of anomalies in the plots was achieved
automatically, which took about two and a half hours.
Figure 2:Scatter plot overview of employees’ data traffic.
Figure 3: Scatter plot overview of persons behavior: blue=data traffic,
green=prox-in-building, red=prox-in-classified, yellow=prox-out-classified.
All suspicious data traffic we
found had destination IP 100.59.151.133 so that probably all traffic to it is
suspicious (see first three columns, Picture 4).
We took a closer look at these
occasions.
Looking at the owners of
suspicious PCs, their office neighbors and their (probable) behavior during the
time suspicious data traffic occurred (manually, using minutes per
month/minutes per day plots), we found out that it was possible to give reasons
for the employees' absence in most cases (see Figure 4). As the traitor doesn't
want to be detected, he/she won't have used an office where anyone was present.
However, on two occasions ID 30 is there and active while his neighbor’s PC is
used.
We know that being in the
classified area is quite a good alibi, so we (manually, for all IDs) counted the
cases, in which an employee had been in the classified area while suspicious
data traffic occurred (see upper half, Figure 5). Only ID 27 and 30 never have
alibis which makes them highly suspicious. Knowing that it is not impossible to
sneak into or out of the classified area, this did not give us definite
results.
ID 30's behavior on the 8th
and 15th led us to counting (manually again) in how many cases each
employee had been active in a one- and two-minute interval around the
suspicious data traffic (see lower half, Figure 5). We could see that ID 30 was
extremely active in the two-minute intervals (nearly twice as active as any
other employee) and concluded that he/she had tried to „fake“ his/her presence
in his own office by having data traffic shortly after leaking confidential
information. Now, was it possible for ID 30 to know when which office was
empty? A look at the office plan reveals that office 15 (ID 30 and 31) offers a
good view over most of the affected offices and the corridor to the classified
area.
This took us about 2 hours.
Figure 4: Listing of suspicious behavior.
We can now detect clear patterns
in 30's behavior.
ID 30 has short gaps in his/her
own data traffic plot shortly before each time suspicious data traffic occurs,
but is often active when he/she is done with his/her malicious conducts. We
imagine him/her preparing some kind of data traffic on his own PC before.
Apart from that, he/she began
slowly with one sending per day, then two, later three. Three of the first five
times were even carried out from his/her own office but he/she became more
careful later and used different offices.
If one divides the suspicious
data traffic like in Figure 5, one group for PCs that have been used while the
employees were in the classified area and one for the rest, a clear pattern is
visible. While the first group events are spread over the whole day, the others
mainly take place in the morning and evening, when many employees are not yet
there or have already gone.
Furthermore, all data was sent on
Tuesdays and Thursdays.
Here, short looks at different
plots (manually, all in all less than half an hour) are enough to detect anomalies;
while a deeper investigation of ID 30's data traffic compared to the rest does
not give any other results than those already mentioned: transmitting to
100.59.151.133.
Figure 5.Final listing of
suspicious behavior with 1-2 minutes before and after a criminal act.
Social Network and Geographic Analysis
(Flitter Data)
The Flitter data was analyzed
using a visual analytics approach as described below
Figure 6 – The
pipeline we used for our analysis: First a data selection and aggregation is
made.
After
that there is an iterative visualization approach.
Selection and Preprocessing
We started our analysis by
getting familiar with the data and writing down the constraints for every
scenario, breaking them up into sections that we judged as necessary, possible
and merely speculative. The data was inserted into a MySQL Database using Navicat
Lite. Then we designed an aggregated table with all the connection information
that was given, e.g. the exact geo location on the map and the connection count
of the users by using a small PHP script, which we wrote in 1 hour. The
connection data itself was loaded into Pajek by using the txt2pajek helper
tool.
We initially visualized the
complete graph by using Pajek’s force directed layout algorithms and started to
reduce the network into a Pajek partition in which vertices are colored
according to the connection count. The result was still a much cluttered view
so we decided to use more constraints to get rid of useless information.
To do that we first defined four
classes (employee, handler, middleman and fearless leader) and assigned the
persons to the classes according to their connection counts. Based on these
classes we added further constraint first with SQL statements, and later we
developed a lightweight java tool to structure the process of adding
constraints.
Visual Analytics Approach
At first the analysis was lead by
the idea to concentrate on the scenario with more information available and
easier constraints which is clearly scenario A. It appeared that scenario B was
not supported directly by the data considering the fixed values for connection
information of the middlemen (which would be 2-3 contacts). The only
possibility according to scenario B was that the middlemen have contact to more
than one of the handlers.
We concentrated on scenario A
first and used the given constraints to reduce the dataset. The critical point
was to check which user of the class employee had connection to at least 3
persons of the class handler and if all of the handlers had contact to someone
with 4-5 contacts. The one with the codename Boris had also to have contact to
the fearless leader having a connection count of over 100.
We wrote our java tool in an
iterative process which took as about 6 hours. In each step of the process we
added a new constraint and then visualized the results with the help of Pajek.
Some constraints e.g. that the handlers are not allowed to communicate among
themselves were not included, because this could be easily seen in the
visualization. As a result we got exactly one network that matched the given
constraints of scenario A.
The next step was to add the tool
support for scenario B. We checked again which user of the class employee had
connection to at least 3 persons of the class handler. But this time it was
possible, that each handler has his own middleman with 2-4 contacts. These
middlemen had to have contact to one potential leader. In the end we saw no
evidence in the data, that scenario B would match.
By mapping the network structure
on the map of Flovania, we realized that the fearless leader didn’t live in a
larger city. But because this geospatial implication was mentioned in the task
description we decided to validate the result again.
In order to do that, we used SQL
statements and visualizations. We started again by looking which employee has
connections to at least 3 handlers. This led to only 13 potential employees.
Than we queried for the connections to potential handlers, middleman and
leaders and visualized the result set for each potential employee separately.
In figure 7 you can see the
visualization of the employee with the ID 19. This network structure nearly
matches the constraints of scenario B. You can see four handlers connected to
one employee. The drawback is that there are only two handlers whose two
middlemen have contact to one leader. In our analysis we found no matching
structure for scenario B at all.
Figure 7 – Network structure of
employee with id 19
By visualizing the network for
the employee with ID 100 (figure 8) it is easy to see that it fits the network
structure of scenario A. There is one employee connected to 3 handlers and they
are connected to one Middleman, who is related to the leader. This is the only
matching structure we found in the data. This mainly manually made analysis for
the 13 employees, took us about 2 hours.
Figure 8 – Network structure of
employee with id 100
4. Result
To visualize our final result we
took the detected employee, the three handlers, the middleman and the fearless
leader and queried for all connections between these persons. Also we added all
international contacts of the fearless leader and the contact of the middleman
Boris to the not jet mentioned member of the organization. In figure 9 you can
see our final network, which seems to be the best matching for the task.
Figure 9 - The complete resulting
network of the criminal organization.
We believe that the person whose
ID is 100 is the employee and the persons with the IDs 194, 261 and 563 are his
handlers. As the three handlers have contact with only one person of the group
of persons with 4 or 5 contacts, this person has to be the middleman Boris.
Boris has the ID 4994. And also Boris has only one contact to the group of
persons with over 100 contacts. There is a contact to the person whose ID is 4.
This person seems to be the Fearless Leader. These entire IDs we get with the
help of our own written tool. Furthermore we found one person who is linked
with Boris and so it is very probable, that the person whose ID is 1612 is also
a member of the organization.
Video Analysis
1. Assumptions
To identify any events of
potential counter intelligence/espionage interest a definition of such an
suspicious event needs to be given. Following events were defined as
suspicious:
These events need to be describe
in a formal way with behavioral pattern. To recognize such an event following
dimensions are to be considered as well:
Suspicious item are as followed:
Suspicious areas are as followed:
Those areas specify the areas of
interest. In order to determine events, items within an area have to be
recognized. Therefore, areas of movement within the video need to be recognized
since every moving object may be indicating suspicious deeds. Those areas of
movement are to be marked and classified, see figure 10.
Figure 10: Classification needs interactive user involvement
Following types are considered as
potential suspicious and need to be determined.
All other moving areas are not to
be considered as suspicious. This means those areas are irrelevant and can be
excluded.
To analyze the video data, an
interactive process based on the KDD (Knowledge Discovery in Databases) -
pipeline was used, as shown in figure 11.
Figure 11: KDD pipeline (Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. From
Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge
Discovery and Data Mining. (1996), 1-34.,
http://www.aaai.org/aitopics/assets/PDF/AIMag17-03-2-article.pdf)
Following this terminology, a
flow chart (Figure 12) was created describing the operative steps required to
conduct a successful analysis of video stream data.
Figure 12: Analysis process of video data
In order to extract the data from
the video, thresholds have to be set manually by the user. The most important
thresholds are as follows.
2.1 Determination of bounding boxes
As the result of the
determination chain the bounding boxes inside a frame are determined. Figure 1
shows the result of the automatic determination of bounding boxes. The color
bar below the frame preview shows the count of bounding boxes over the time,
each line stands for one location, starting with the first one. The lighter the
color is, the more bounding boxes were found in this time slot. Each movement
of the camera position indicates a change of location. This leads to the
location information relevant for the result.
2.2 Classification of Bounding Boxes
The process of classification is
an interactive process divided into two sub processes.
No. |
name |
color |
R |
G |
B |
1 |
human |
green |
77 |
157 |
74 |
2 |
Two
humans |
orange |
255 |
127 |
0 |
3 |
car |
red |
228 |
26 |
28 |
4 |
Two
cars |
blue |
126 |
126 |
184 |
Table 1: Mapped
colors to the classified bounding boxes for visualization
This trained data will be used
for the next step (Multi Layer Perception Predictor, Decision Tree Predictor)
as picture 13 shows. At the end of this step a table containing all bounding
boxes of one sub-video results.
Figure 13: Classification needs user interaction and computing using prediction
algorithms
3. Determination of Suspicious Events
Once the patterns have been
recognized, the suspicious events can be reviewed by visualization of the
patterns, as shown in Figure 14.
Figure 14: patterns can be verified manually and marked to be exported to a
result table
As a result the most relevant
pattern is
This means also, that two persons
may walk down a street together (implicates previous meeting).
The pattern:
needs to be redefined for another
run since too many events were found.
4.2 Performance Comparison of Automatic and Interactive Parts
Performance is assessed for the
interactive and automatic parts of the process chain. Process times for the user
as well as for the hardware (server, pc) are listed separately in the table 2.
No. |
Process
step |
time
in Min (user) |
time
in Min (HW) |
1 |
Frame
Extraction |
0 |
180
- 360 |
2 |
Set
up Thresholds |
5
- 15 |
5
- 15 |
3 |
Determination
of bounding boxes |
0 |
180
- 240 |
4 |
Classifying of a subset of
bounding boxes |
5
- 15 |
5
- 15 |
5 |
Filtering
Boxes 10 |
<1 |
<1 |
6 |
Visualisation |
0 |
<1 |
7 |
Pattern
Recognition |
0 |
<1 |
8 |
Pattern
Recognition Review |
5
- 30 |
0 |
Table 2: comparison
of user and hardware process times for video 1
Table 3 shows the reduction of
data for video 1. The
final relevant data was reduced
to 0,008 % of the potential relevant data.
No. |
Process
step |
count of table rows input |
count of table rows output |
1 |
Frame
Extraction |
0 |
0 |
2 |
Set
up Thresholds |
0 |
0 |
3 |
Determination
of bounding boxes |
0 |
143528 |
4 |
Classifying of a subset of
bounding boxes |
143528 |
143528 |
5 |
Filtering
bounding boxes |
143528 |
93865 |
6 |
Visualization |
93865 |
93865 |
7 |
Pattern
Recognition |
93865 |
3859 |
8 |
Pattern
Recognition Review |
3859 |
12 |
Table 3: Data
reduction of video one leads to relevant events
Compared to the complete video
time (4h) the user interaction takes between 25 to 70 minutes. The VAT-Tool
enables an analyst to focus her/his attention on a limited amount of
automatically preselected events, while it would otherwise be very difficult
and exhausting to attentively watch the whole videos with several hours of
duration.